LCCM Code Walk-through Example

The following notebook demonstrates how this latent class choice model code works. We will be using an example dataset (Qualtrics data long format.csv) to perform model specification, estimation and sample enumeration. Our objective is to understand mode choice decisions across different latent classes.

The dataset being used entails a carsharing dataset used by a UC Berkeley student as part of his/her research. The dataset, called "Qualtrics data long format.csv" can be found in the github repository. More detail on the dataset could be found on the github repository as well (QualtricsData.pdf) in terms of the variables being used, etc.

The data set was obtained by surveying Bay Area private transit users to understand if users of these services have different modality styles (unobserved latent lifetstyles that dictate the set of modes they consider when making trips and mode choice decisions). Each respondent provided his travel diary for an entire day indicating travel time for the available modes, incurred travel cost, and several other information. Socio-demographic variables where also obtained via the survey.

The overall set of possible choices in this dataset was enumerated from 1 to 6 to denote "Auto", "Walk", "Bike", "Walk to Transit", "Drive to Transit", and "Private Transit" alternatives.

Troughout this example, we will be highlighting the various steps required to make the lccm package run.

Step 1: Import the LCCM Package and All other Necessary Packages


In [2]:
import lccm
import numpy as np
import pandas as pd
import pylogit
import warnings
from collections import OrderedDict

Step 2: Load the Dataset.

The lccm package handles data in long format (NOT wide format). To see the difference between the two formats run the following lines of code, (make sure your excel files are in the same directory as your code file):

df_long = pd.read_csv('Qualtrics data long format.csv')

df_long.head()

df_wide = pd.read_excel('Qualtrics data wide format.xlsx')

df_wide.head()

If you do not want to convert the wide formatted data yourself, you can check the following link, which provides instructions regarding how one can convert his/her dataset from wide format to long format. https://github.com/timothyb0912/pylogit/blob/master/examples/notebooks/Main%20PyLogit%20Example.ipynb


In [3]:
# Load the data file
inputFilePath = 'C:/Users/Feras/Desktop/LCCM github exmaple/'
inputFileName = 'Qualtrics data long format.csv'

df = pd.read_csv(open(inputFilePath + inputFileName, 'rb'))

# Let's have a quick look at the long format data (first 5 rows)
df.head()


Out[3]:
custom_id mode_id choice ID tripID toll distance Age Gender bike_ownership Hhsize HHIncome car_ownership travel_time travel_cost
0 1 1 0 1 1 0 0.8 31 1 1 2 17 2 0.0627 0.1696
1 1 2 1 1 1 0 0.8 31 1 1 2 17 2 0.2667 0.0000
2 1 3 0 1 1 0 0.8 31 1 1 2 17 2 0.0667 0.0000
3 1 4 0 1 1 0 0.8 31 1 1 2 17 2 0.6207 1.4100
4 1 5 0 1 1 0 0.8 31 1 1 2 17 2 0.4450 1.4100

Step 3: Model Specification

3.1- Tips for running the latent class choice model:
  • 1) Latent class models are mixture models and have several local maxima so their estimation is not straight forward.
  • 2) Begin with a simple two class model by randomly intializing starting values of the parameters.

  • 3) Run the model several times until you get reasonable estimates.

  • 4) Store these parameter estimates and add another class to the model.
  • 5) Use the parameter estimates stored earlier as starting values in the new model for the first two classes, and randomize the starting value of parameter estimates for the new latent class.
  • 6) Repeat the steps(3,4, and 5) as you increase the number of latent classes in the model specification.

In [4]:
# Setting the number of classes in the model to 2
n_classes = 2

Modify variables, create dummy variables,and scale variables so the estimated coefficients are of similar magnitudes.


In [5]:
# Create a dummy variable indicating whether a certain individual in the sample is male or female
df['male'] = (df['Gender']==1).astype(int)

# Create a categorical variable to indicate the income level of the household corresponding to each individual
# in the sample. We will define three categories: low income (income category  < 6) , medium income
# (6 <= income category < 12) and high income (12 <= income category < 16). 
df['low_income'] = (df['HHIncome']<6).astype(int)
df['medium_income'] = ((df['HHIncome']>=6)&(df['HHIncome']<12)).astype(int)
df['high_income'] = ((df['HHIncome']>=12)&(df['HHIncome']<16)).astype(int)
3.2- Membership Model :

The specification of the class memeberhsip model used in this example is as follows:

$V_{class1} = 0 $

$V_{class2} = ASCClass2 + \beta_{CarOwnership, class2} * CarOwnerhsip $

$+ \beta_{LowIncome, class2} * LowIncome + \beta_{HighIncome, class2} * HighIncome $

$+ \beta_{Male, class2} * Male + \beta_{DistanceTraveled, class2} * DistanceTraveled $


In [6]:
# NOTE: Specification and variable names must be in list data structures.
# class_membership_spec defines the variables names to be used in the specification of the 
# class membership model.
# class_membership_labels defines the names associated with each of the variables which
# will be displayed in the output table after estimation.

# NOTE: by specifying the word 'intercept' in the class_membership_spec, the code
# understands that this denotes an alternative specific constant.

# class membership model constrains the utility of the first class to zero by default. 
# The same socio-demographic variables will be included in the class memeberhship 
# model of each of the remaining classes (excluding the first class as it is the base).
class_membership_spec = ['intercept', 'car_ownership','low_income','high_income',
                         'male','distance']
class_membership_labels = ['Class-specific constant', 'Car ownership', 'Low Income','High Income', 
                           'Male','Distance Traveled (miles)']
3.3- Defining the choice set for each latent class

In [7]:
# Set the available alternatives for each latent class
# Each array entials the alternatives available in the choice set for each latent class 
# NOTE: By default the code does not require the user to specify the choice set
#       for each class. The code assumes that all alternatives are available in the choice
#       set for each latent class.
# We are assuming that the bike alternative does not exist in the choice set for latent class 1
# We are assuming that all alternatives are available for latent class 2
avail_alts = (np.array([1,2,4,5,6]),
              np.array([1,2,3,4,5,6]))
3.4- Class-specific Choice Model:

You can specify your parameters as generic or alternative specific. The following example entails both types of specifications to help the modeler identify how to specify generic versus alternative specific parameters accordingly.

In this example, the intercepts are alternative specific and that is done by using only one bracket i.e:'intercept', [2,3,4,5,6]. The first alternative will be the base alternative and hence no intercept is allocated in its utility.

Note that we will be constraining the choice set also for latent class 1 whereby for this class we have the following specification: 'intercept', [2,4,5,6]. The bike alternative does not belong in the choice set for this class and hence no parameters will be estimated including the ASC.

Travel time parameters across all alternatives for both latent classes are generic. This is done by using two brackets i.e: 'travel_time', [[1,2,3,4,5,6]] according to the specification below. Note that for latent class 1, we drop travel time from alternative 3 (bike) as that alternative does not exist in the choice set.

Travel cost parameter is constrained to be the same for the auto and drive to transit alternatives for latent class 1. Also, the travel cost parameter is constrained to be the same for the remaining alternatives in latent class 1. Such a specification is done according to the following script: 'travel_cost', [[1,5],[4,6]] based on the specification below.

Travel cost parameter is generic for all alternatives for latent class 2.

The specification of the class specific choice model used in this example is as follows:

Latent Class 1:

$V_{auto} = \beta_{tt, class1} * TravelTime_{auto} + \beta_{cost\_Auto-DrivetoTransit, class1} * TravelCost_{auto} $

$V_{walk} = ASCWalk_{class1} + \beta_{tt, class1} * TravelTime_{walk}$

$V_{WalkToTransit} = ASCWalkToTransit_{class1} + \beta_{tt, class1} * TravelTime_{walktotransit} + \beta_{cost\_WalktoTransit-PrivateTransit, class1} * TravelCost_{walktotransit} $

$V_{DriveToTransit} = ASCDriveToTransit_{class1} + \beta_{tt, class1} * TravelTime_{drivetotransit} + \beta_{cost\_Auto-DrivetoTransit, class1} * TravelCost_{drivetotransit} $

$V_{PrivateTransit} = ASCPrivateTransit_{class1} + \beta_{tt, class1} * TravelTime_{privatetransit} + \beta_{cost\_WalktoTransit-PrivateTransit, class1} * TravelCost_{privatetransit} $

Latent Class 2:

$V_{auto} = \beta_{tt, class2} * TravelTime_{auto} + \beta_{cost, class2} * TravelCost_{auto} $

$V_{walk} = ASCWalk_{class2} + \beta_{tt, class2} * TravelTime_{walk}$

$V_{bike} = ASCBike_{class2} + \beta_{tt, class2} * TravelTime_{bike}$

$V_{WalkToTransit} = ASCWalkToTransit_{class2} + \beta_{tt, class2} * TravelTime_{walktotransit} + \beta_{cost, class2} * TravelCost_{walktotransit} $

$V_{DriveToTransit} = ASCDriveToTransit_{class2} + \beta_{tt, class2} * TravelTime_{drivetotransit} + \beta_{cost, class2} * TravelCost_{drivetotransit} $

$V_{PrivateTransit} = ASCPrivateTransit_{class2} + \beta_{tt, class2} * TravelTime_{privatetransit} + \beta_{cost, class2} * TravelCost_{privatetransit} $


In [8]:
# NOTE: Specification and variable names must be in lists of ordered dictionaries.
# class_specific_specs defines the variables names to be used in the specification of the 
# class specific choice model of each class.
# class_specific_labels defines the names associated with each of the variables which
# will be displayed in the output tables after estimation.

# NOTE: by specifying the word 'intercept' in the class_specific_specs, the code
# understands that this denotes an alternative specific constant.

class_specific_specs = [OrderedDict([('intercept', [2,4,5,6]), 
                                     ('travel_time', [[1,2,4,5,6]]),
                                     ('travel_cost', [[1,5],[4,6]])]),
                        OrderedDict([('intercept', [2,3,4,5,6]), 
                                     ('travel_time', [[1,2,3,4,5,6]]),
                                     ('travel_cost', [[1,4,5,6]])])]
                       
class_specific_labels = [OrderedDict([('ASC', ['ASC(Walk)',
                                               'ASC(Walk to Transit)','ASC(Drive to Transit)',
                                               'ASC(Private Transit)']),
                                      ('Travel Time',['Travel Time ']), 
                                      ('Travel Cost',['Travel Cost Auto and Drive to Transit', 'Travel Cost WalktoTransit and PrivateTransit'])]),
                         OrderedDict([('ASC', ['ASC(Walk)','ASC(Bike)',
                                               'ASC(Walk to Transit)','ASC(Drive to Transit)',
                                               'ASC(Private Transit)']),
                                      ('Travel Time',['Travel Time']), 
                                      ('Travel Cost',['Travel Cost'])])]

Step 4: Accounting for Choice-based Sampling

The code by default assumes a non-choice-based sampling method and hence all individual weights are assumed to be equal to one. However, if the sample is choice-based, then the modeler can account for this by incorporating individual weights for the log-likelihoods.

The user needs to specify a 1D numpy array of size that is equal to sample size. Each element accounts for the associated weight for each individual in the data file to cater for the choice based sampling scheme, building off Ben-Akiva and Lerman (1983).

Step 5: Starting Values for Parameter Estimates

By default the code does not require the user to specifcy starting values for parameters for both the class membership and class specific choice models. The code will generate random starting values automatically.

However, since this is a non-convex optimization problem with multiple local maxima, starting values for parameter estimates are most likely needed as the number of latent classes increases.


In [9]:
# Specify starting values for model parameters. Again this is optional and the modeler does
# not have to do so for estimation. 
# This section can be completely skipped.

# Class membership model parameters
paramClassMem = np.array([0,0,0,0,0,0])

# Class specific choice model parameters
paramClassSpec = []
for s in range(0, n_classes):
    paramClassSpec.append(np.array([-2.14036027,-2.60680512,-2.86731413,-2.65139932,
                                    -0.0000189449556,-0.0489097045,-0.0489097045]))
    paramClassSpec.append(np.array([1.353,-1.1648,1.0812,-1.9214,1.3328,
                                    -1.2960,-0.0796]))

Step 6: Estimation of Latent Class Choice Model and Output Table

Estimation of the latent class choice model happens here via this chunk of code that incorporates the specification needed, starting values for parameter estiamtes if needed, choice set for each class if needed, choice-based sampling weights if needed.

Following that, the model outputs parameter estimates for the class membership and class-specific choice models in addition to the standard errors, t-stats and p-values. Statistical measures of fit, rho bar squared, AIC, BIC, fitted log-likelihood and other meaures are computed and displayed as well.


In [10]:
# Fit the model
# In order to better understand the various variables that are needed as input
# in the lccm_fit function, the user is encouraged to use the following command
# help(lccm.lccm_fit), which will identify the required input variables in the
# lccm_fit function below.

with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    
    lccm.lccm_fit(data = df,
                  ind_id_col = 'ID', 
                  obs_id_col = 'custom_id',
                  alt_id_col = 'mode_id',
                  choice_col = 'choice', 
                  n_classes = n_classes,
                  class_membership_spec = class_membership_spec,
                  class_membership_labels = class_membership_labels,
                  class_specific_specs = class_specific_specs,
                  class_specific_labels = class_specific_labels,
                  avail_alts = avail_alts,
                  #indWeights = indWeights,
                  outputFilePath = inputFilePath,
                  paramClassMem = paramClassMem,
                  paramClassSpec = paramClassSpec)


Processing data
Initializing EM Algorithm...

<Tue, 25 Apr 2017 14:37:34> Iteration 0: -625.7786
<Tue, 25 Apr 2017 14:37:35> Iteration 1: -583.4199
<Tue, 25 Apr 2017 14:37:35> Iteration 2: -581.0551
<Tue, 25 Apr 2017 14:37:35> Iteration 3: -580.3797
<Tue, 25 Apr 2017 14:37:35> Iteration 4: -580.1290
<Tue, 25 Apr 2017 14:37:35> Iteration 5: -580.0112
<Tue, 25 Apr 2017 14:37:35> Iteration 6: -579.9466
<Tue, 25 Apr 2017 14:37:35> Iteration 7: -579.9084
<Tue, 25 Apr 2017 14:37:35> Iteration 8: -579.8851
<Tue, 25 Apr 2017 14:37:35> Iteration 9: -579.8710
<Tue, 25 Apr 2017 14:37:35> Iteration 10: -579.8625
<Tue, 25 Apr 2017 14:37:35> Iteration 11: -579.8574
<Tue, 25 Apr 2017 14:37:35> Iteration 12: -579.8545
<Tue, 25 Apr 2017 14:37:35> Iteration 13: -579.8528
<Tue, 25 Apr 2017 14:37:35> Iteration 14: -579.8518
<Tue, 25 Apr 2017 14:37:35> Iteration 15: -579.8513
<Tue, 25 Apr 2017 14:37:35> Iteration 16: -579.8510
<Tue, 25 Apr 2017 14:37:35> Iteration 17: -579.8508
<Tue, 25 Apr 2017 14:37:35> Iteration 18: -579.8507

Enumerating choices for the sample


Number of Parameters:                                 20
Number of Observations:                              580
Null Log-Likelihood:                             -983.93
Fitted Log-Likelihood:                           -579.85
Rho-Squared:                                        0.41
Rho-Bar-Squared:                                    0.39
AIC:                                              1199.7
BIC:                                              1287.0
Estimation time (minutes):                          0.02



Class 1 Model: 
-----------------------------------------------------------------------------------------
Variables                                     parameters    std_err     t_stat    p_value
-----------------------------------------------------------------------------------------
ASC(Walk)                                        -2.9588     0.3014    -9.8165     0.0000
ASC(Walk to Transit)                             -3.2218     0.4639    -6.9454     0.0000
ASC(Drive to Transit)                            -3.4901     0.4844    -7.2043     0.0000
ASC(Private Transit)                             -3.8271     0.5568    -6.8735     0.0000
Travel Time                                       0.0000     0.0000     0.5212     0.6022
Travel Cost Auto and Drive to Transit            -0.0224     0.0130    -1.7261     0.0843
Travel Cost WalktoTransit and PrivateTransit     -0.0041     0.0123    -0.3305     0.7410
-----------------------------------------------------------------------------------------

Class 2 Model: 
-----------------------------------------------------------------------------------------
Variables                                     parameters    std_err     t_stat    p_value
-----------------------------------------------------------------------------------------
ASC(Walk)                                         1.3272     0.1861     7.1319     0.0000
ASC(Bike)                                        -1.8951     0.3255    -5.8228     0.0000
ASC(Walk to Transit)                              0.6514     0.2643     2.4651     0.0137
ASC(Drive to Transit)                            -2.6710     0.7150    -3.7357     0.0002
ASC(Private Transit)                              1.4047     0.3550     3.9566     0.0001
Travel Time                                      -2.1417     0.2422    -8.8429     0.0000
Travel Cost                                      -0.1579     0.0277    -5.7124     0.0000
-----------------------------------------------------------------------------------------

Class Membership Model:
-----------------------------------------------------------------------------------------
Variables                                     parameters    std_err     t_stat    p_value
-----------------------------------------------------------------------------------------
Class-specific constant (Class 2)                 0.6177     0.4413     1.3999     0.1616
Car ownership (Class 2)                          -0.3319     0.1679    -1.9765     0.0481
Low Income (Class 2)                              0.6386     0.9238     0.6913     0.4894
High Income (Class 2)                             0.6278     0.4300     1.4600     0.1443
Male (Class 2)                                   -0.7279     0.4117    -1.7680     0.0771
Distance Traveled (miles) (Class 2)              -0.0264     0.0191    -1.3818     0.1670
-----------------------------------------------------------------------------------------

Step 7: Sample Enumeration

Finally, the model generates a csv file (ModelResultsSampleEnum.csv) in your working directory that calculates the probability of each person belonging to each class and choosing each of the available alternatives across all latent classes. From this file you can generate graphs of percentage class membership and mode choice shares across latent classes.

The output (ModelResultsSampleEnum.csv) file will entail the following stacked column-wise: 1- index of each individual in the sample 2- probability of belonging to each class for each individual 3- probability of choosing a certain alternative from the full complete choice set for each latent class. The proabilities are enumerated across all alternatives within a certain class and then looped across all latent classes.

An example of this, please refer to the excel file: example sample enumeration.xlsx. Check it out for my detail on how to interpret the sample enumeration results and what sort of plots and results you can quantify. Feel free to perform your analysis in R or Python if excel is not your favorite tool. I simply copy and pasted all values from the ModelResultsSampleEnum.csv file into the sample enumeration.xlsx file and performed the required analysis.